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Abstract 

In this notes we describe an algorithm for non-linear fitting which incorporates some of the 
features of linear least squares into a general minimum fit and provide a pure Python im- 
plementation of the algorithm. It consists of the variable projection method (varpro), combined 
with a Newton optimizer and stabilized using the steepest descent with an adaptative step. The 
algorithm includes a term to account for Bayesian priors. We performed tests of the algorithm 
using simulated data. This method is suitable, for example, for fitting with sums of exponentials 
as often needed in Lattice Quantum Chromodynamics. 

1 Introduction 

Fitting is one the most common recurrent tasks in scientific computing. Fitting involves minimizing 
the i-e. weighted sum of squares of the difference between observed data and expected data [1]. 
The expected data is computing using a function / which depends on some unknown parameters. 
The is minimized by varying these unknown parameters in order to find a minimum. 

Usually there are two common approaches to fitting: 

• If the function / is linear in the unknown parameters, then the fitting problem has an exact 
solution which can be computed via the linear least squares. 

• If the function / is nonlinear in its parameters, the minimization is carried out using a non- 
linear optimization algorithm based on the Newton method [3j, the Steepest Descent or 
more sophisticated techniques such as the Levenberg-Marquadt. 

There is also a non-linear least squares algorithm. It is different from the one described in this 
paper and it is equivalent to performing a Newton optimization where the first order derivatives of 
the fitting functions are computed analytically. 



When the fitting function depends of both some linear parameters and some non-linear ones, the two 
methods above can be combined. This approach, described here, is known as VARiable PROjection 
method jS]- An implemention of it which uses the Levenberg-Marquadt algorithm for the non- 
linear part can be found in Netlib [6|. The approach consists of defining a function g which takes 
as input the non-linear parameters of the original fitting function and returns the The function 
(7, internally uses the least square algorithm to compute the linear coefficients exactly (as function 
of the non- linear coefficients and the input data) and uses this result to compute the x^. In our 
implementaton the complete fit is performed by passing the function g to a non-linear Newton 
optimizer stabilized using an adaptative steepest-descent. 

One application of this procedure is, for example, in lattice QCD, where typically one has to fit 
data points using a sum of exponentials of the form ajC"'''*. and hi are the parameters to be 
determined. 

Fits like the one described become unstable when two of the h coefficients are very close to each 
other. In order to avoid this situation we modified the algorithm to account for a priori constraints 
on the fitting parameters in the form of a Bayesian contribution to the x^. This can be used to 
stabilize the fit [7]. 



2 Algorithm 

We want to fit data points of the form {xi,yi it 5yi). We will do it by minimizing the defined as 



Vi - /(xi,a,b) 



Syi 



The function / is known but depends on unknown parameters a = (ao, a-i, ■■■) and b = (ao, oi, ■ 
In terms of these parameters the function / can be written as follows: 



/(x,a,b) = ^ajfj{x,h) 



An example is the following one: 

fix, a, b) = aoe-^o" + aie"^!" + aae-''^" + 



(3) 



The goal of our algorithm is to efficiently determine the parameters a and b which minimize the 



We proceed by defining the following two quantities: 
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In terms of A and z the ^ can be rewritten as 



x'(a,b) = |A(b)a-z|^ 
We can minimize this function (in a) using the linear least squares algorithm, exactly: 

a(b) = (^(b)U(b))-i(^(b)*z) 
We define a function which returns the minimum for a fixed input b: 

g{h) = minx^(a,b) = x^(a(b),b) = |yl(b)a(b) - z|^ + Bayesian(b) 



(4) 



(5) 



(6) 



(7) 



Hence we have reduced the original problem to a simple problem by reducing the number of unknown 
parameters from Na + N^^ to Ni,. 

We have arbitrarily added a term, Bayesian to the definition of x^. This term can be set to zero, 
but also, according to Lepage et al. [7J, we can choose it as follows: 
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(8) 

where bi bi i: 6bi is the a priori knowledge (or assumption) about the expected values for the 
b parameters, according to Bayesian statistics. The Bayesian term originates from the fact that 
ordinary statistics concerns the computation of probability that data is compatible with a given 
model while here we are concerned with the opposite problem, the probability that a certain model 
(defined by the unknown parameters a and b) is compatible with given data. In practice the effect 
of the Bayesian term is that of stabilizing the fit by constraining the values within a range specified 
by a point and its uncertainty by increasing the quadratically when the fitting parameters depart 
from the priors. 

The methodology described here (excluding the Bayesian contribution) is known as VARPRO [5j 
algorithm. 



Bayesian(b) = ^ 



dbi 
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3 Code 



The complete code is reported in the the Appendix. It contains the algorithm implemented in 
Python (use suggest version 2.7) and it uses the numpy library for linear algebra. The code defines 
partial derivatives, the gradient, the Hessian, and the 1-norm of a matrix. These functions are 
necessary for the inner workings of the Newton algorithm. The code also includes a modified multi- 
dimensional Newton optimization algorithm that reverts to the steepest descent when the Newton 
step fails to reduce the value of the function passed as input. The step of the minimum descent 
is adjusted automatically: it starts with a guess and reduces the step until the is decreased by 
the move. This guarantes a decrease of the each iteration. Although it never explodes, it 

may still fail. This can happen because the target precision is not reached within the maximum 
number of allowed iterations or because the Hessian becomes singular, or because A matrix in the 
least squared fit becomes singular. 

The main logic described in this notes is implemented in the function called fit. 
The input of the function fit is: 

• data is a list of {x, y, 5y) points to be fitted. 

• f s is a list of functions fi as described in the previous section. Each function must have the 
signature lambda b,x: ... where 6 is a list of h coefficients. 

• b is a list of b values from which to start the optimization. 

• ap=le-6 is the required target precision in the output. 

• rp=le-4 is the required target relative precision. 

• ns=200 is the maximum number of steps performed by the Newton algorithm. If the Newton 
algorithm fails to converge within the required target absolute or relative precision within ns 
steps, the algorithm raises an exception. 

• bayesiaii=None can be set to a function of b whose value is be added to the in the 
intermediate steps. 

The function fit returns a tuple containing a, 6, the and the Hessian, a and h are lists. 



4 Examples 

Here we report two examples of usage of the algorithm. In the first one we consider a fitting function 
of the form 

f{x) = aox + aix'^ H —- (9) 

x + bo 
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Figure 1: Example of simulated data (100 points) and their fit with three exponentials 

We generated 100 points using the input values of oq = 1, ai = 2, 02 = 300 and bo = 10 for 
X = 0.1, 0.2, 0.3, ...10. For each point we computed y using the above formula and we added a 
random Gaussian noise with standard deviation equal to 1% of y. We set 6y = 1% for each point. 
We then fitted the data using our method: 
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>>> 


import random 
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>>> 


def noise ( i , X , r=0 . 05 ) : return (float(i) 


, X* ( 1 . 0+ random . gauss (0,r)) ,abs(x)*r) 
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>>> 


def f(x): return x+x **2+300 . 0/ ( x+10) 
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>>> 


data = [noise (0 . l*i , f (0 . l*i) ,0 . 01) for 


i in range (1 , 101) ] 
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>>> 


f s = [ ( lambda b , x : x ) , 
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( lambda b , x : x * x ) , 
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(lambda b.x: 1 . 0/ ( x+b [0] ) ) ] 
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>>> 


a , b, clii2 , H = f it (data , f s , b= [5] ) 




9 


>>> 


print a, b 
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The result shows that the fit converged within a few percent accuracy to the true values. Of course 
we do not expect the output to yield the exact same values which we used to generate the data, 
because we added the noise. The point here is that despite the noise, the fit is stable and consistent 
with expectation. 

Notice that the notation lambda b,x: x*x is nothing but the inUnc definition of an un-named 
function which takes b,x and input and produces x*x as output. Hence f s is a list containing three 



fit example 
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fit results 
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Figure 2: Parallel coordinates plot for 50 simulated experiments. The data from each simulated 
experiment is fitted and the resulting bo, 6i, 62 are shown in the plot, connected by a line. The 

density of lines indicates the most likely range of results and their correlation. The results are 
compatible with the input used to generate the simulated experiments: bo = —0.10, bi = —0.04, 
62 = -0.02. 

functions. 

In our second test we used a function which is the sum of three exponentials: 

fix) = aoe-^o^ + aie-^i^ + a2e-^^^ (10) 

We performed a scries of 50 simulated experiments. In each experiment we generated 100 data 
points X equally spaced between and 32 and computed the corresponding y's using the above 
formula with the input oq = 100, ai = 10, 02 = 1 and bo = —0.10, 61 = —0.04 and 62 = —0.02 
adding a 2% Gaussian noise and assuming a 2% error on each point. We then fit each simulated 
experiment. To improve the stability of the fit we added a Baycsian contribution to the which 
corresponds to priors 60 ~ -0.11 ± 0.04, 61 ~ -0.05 ± 0.04, and 62 ~ -0.03 ± 0.04. 

Here is the code for each simulated experiment: 

1 >>> from math import exp 

2 >>> def f(x): return 100 . 0* exp ( -0 . 10*x) +20 . 0* exp ( -0 . 04*x) +4 . 0* exp ( -0 . 02*x) 

3 >>> xs = [0.3*i for i in range (100)] 

4 >>> data = [noise (x , f (x) ,0.02) for x in xs] 
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6 (lambda b,x: exp (b [1] *x) ) , 

7 (lambda b,x: exp (b [2] *x) ) ] 

8 >>> def bayesian (b) : return ( ( b [0] +0 . 1 1 ) **2+ ( b [ 1] +0 . 05 ) **2+ ( b [2] +0 . 03) * + 2) /O . 04* *2 

9 >>> a, b, chi2 , h = f it ( data , f s , b= [ -0 . 1 1 , -0 . 05 , -0 . 03] , ns = 100 , bay es ian = bayes ian ) 

10 >>> print a, b 

11 [98.041..., 10.795..., 16.281...] , [-0.103..., -0.052..., -0.029...] 

The result of the fit is also shown in fig. [T] 

In this example the contribution of the third exponential is too small to be determined form data. 
We have ignored results that are more than two standard deviations off from priors and we have 
displayed the resulting fitting values for bo, bi and 62 in a parallel coordinates plot, fig. [2j The 
X-axis represent the i index of the bi parameters and the Y-axis the corresponding values. The 
lines connect the 6o)^i)^2 values for each simulated experiment. Higher density of lines represents 
the most likely values and their correlation. 

We found that in about 10% of our simulated experiments the algorithm failed to converge but this 
could be corrected by changing slightly the input starting point for the Newton optimizer. 

5 Conclusions 

In this notes we descibe the varpro method for non-linear fitting which uses linear least squares fitting 
to reduce the space of the parameters to be explored. We provided an implementation in Python 
which uses a stabilized Newton method the non-linear part of the fit and includes contribution from 
Bayesian priors. In the case of fitting with a sum of exponentials it reduces the size of the parameter 
space by a factor of two. We have provided working code and examples of fitting simulated data. 
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Appendix 



1 ### Developed by Massimo Di Pierro <mdipierro@cs.depaul.eclu> 

2 ### License : BSD 

3 

4 from numpy import matrix 

5 from numpy . linalg import * 
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7 def partial (f , i ,h=le-4) : 
g II II II 

9 definition of paritial derivative, df/dx_i 

10 

11 def df (x , f = f , i = i , h=h) : 

12 X [i] +=h 

13 u = f(x) 

14 x[i]- = 2*h 

15 V = f (x) 

16 x [i] +=h 

17 return (u-v)/2/h 

18 return df 

19 

20 def gradient (f, x, h=le-4) : 

21 

22 gradient of f in x 

23 " " " 

24 s = xrange ( len (x) ) 

25 return matrix ([ [partial (f , r , h) (x) ] for r in s]) 

26 

27 def hessianCf, x, h=le-4) : 

28 " " " 

29 hessian of f in x 

30 

31 s = xrange ( len (x) ) 

32 grad = [part ial (f , r , h) for r in s] 

33 return matr ix ([ [part ial ( grad [r] , r , h) (x) for c in s] for r in s] ) 

34 

35 def norm (A) : 

36 

37 defines norm of a matrix to check convergence 

38 

39 rows, cols = A. shape 

40 return max ( [ sum ( abs ( A [r , c ] ) for r in xr ange ( rows ) ) \ 

41 for c in xr ange ( cols )] ) 

42 

43 def tolist (A) : 

44 rows, cols = A . shape 

45 return [A[r,0] for r in xrange ( rows ) ] 

46 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 
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def 


opt iniize_newt on_mul t i _inipor ved (f , x, ap = le-6, rp = le 
II II II 


-4 , ns =20) : 


1 


Multidimensional Newton optimizer 






on failure is performs a steepest descent 
II II II 






fx = f(x) 






X = matr ix ([[ element ] for element in x] ) 






h = 10.0 






for k in xrange(ns) : 






grad = gradient (f , tolist (x) ) 






(grad,H) = (gradient (f , tolist (x) ) , hessian (f , tolist (x) ) ) 




if norm(H) < ap : 






r&iSG Ar i t hm e t i c Er r o r , 'unstable solution' 






(fx_old, x_old , x) = (fx, X, X- (1 . 0/H) *grad) 






fx = f (tolist(x)) 






while fx>fx_old: # revert to steepest descent 






(fx, x) = (fx old, x old) 






n — norm (grad) 






(x old, x) = (x, X ~ grad/n*li) 






(fx_old, fx) = (fx, f (tolist (x) ) ) 






h = h/2 






h = norm (x - x_ old ) *2 






if k>2 and h/2 <max ( ap , norm ( x) * rp ) : 






X = tolist(x) 






return x, hessian (f , x) 






raise ArithmeticError , 'no convergence' 




def 


fit(data, fs, b, ap=le-6, rp=le-4, ns=200, bayesian 


=None ) : 




na = len(fs) 






def least _ square s (b,data = data,fs = fs) : 






A = matrix ([ [fs [k] (b , x) /dy for k in xrange(na)] 


for (x,y,dy) in data]) 




z = matrix ([ [y/dy] for (x,y,dy) in data]) 






a = inv (A . T*A) * ( A . T*z) 






chi2 = norm (A*a-z) *+2 






return a, chi2 






def g(b,data=data,fs=fs,bayesian=bayesian) : 






a, chi2 = least_squares (b , data, fs) 






if bayesian : 






chi2 += bayesian(b) 






return chi2 






b, H = optimize_neMton_multi_imporved (g ,b , ap , rp , ns) 






a, chi2 = least_squares (b , data , f s ) 
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